Chinese sentence segmentation as comma classification

نویسندگان

  • Nianwen Xue
  • Yaqin Yang
چکیده

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmentation of Chinese Long Sentences Using Commas

The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role i...

متن کامل

Maximum Entropy for Chinese Comma Classification with Rich Linguistic Features

Discourse relation is an important content of discourse semantic analysis, and the study of punctuation is of importance for discourse relation. In this paper, we propose a method of Chinese comma classification based on maximum entropy (ME). This method classifies the sentence relation based on comma with ME by extracting rich linguistic features before and after the commas in sentences. Exper...

متن کامل

Chinese Comma Disambiguation for Discourse Analysis

The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chine...

متن کامل

Dependency parsing for Chinese long sentence: A second-stage main structure parsing method

This paper explores the problem of parsing Chinese long sentences. Inspired by human sentence processing, a second-stage parsing method, referred as main structure parsing in this paper, are proposed to improve the parsing performance as well as maintaining its high accuracy and efficiency on Chinese long sentences. Three different methods have attempted in this paper and the result shows that ...

متن کامل

A Segmentation Matrix Method for Chinese Segmentation Ambiguity Analysis

Chinese Segmentation Ambiguity (CSA) is a fundamental problem confronted when processing Chinese language, where a sentence can generate more than one segmentation paths. Two techniques are commonly used to identify CSA: Omni-segmentation and Bi-directional Maximum Matching (BiMM). Due to the high computational complexity, Omni-segmentation is difficult to be applied for big data. BiMM is easie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011